Using cache and bloom filters for url lookups

ABSTRACT

Enforcing a policy based at least in part on URL information is disclosed. A request to access a first uniform resource locator (URL) is received from a client device. A portion of the first URL, or a transformation thereof, is matched against a bloom filter. A first query is performed using the matched portion of the received first URL, and, in response to receiving a “no match” response to the first query, a second query that is different from the first query is performed. A policy is enforced based at least in part on a category received as a result of a second query.

CROSS REFERENCE TO OTHER APPLICATIONS

This application is a continuation of co-pending U.S. patent applicationSer. No. 13/111,131 entitled USING CACHE AND BLOOM FILTERS FOR URLLOOKUPS filed May 19, 2011 which is incorporated herein by reference forall purposes.

BACKGROUND OF THE INVENTION

Firewalls and other security devices typically enforce policies againstnetwork transmissions based on a set of rules. In some cases, the rulesmay be based on uniform resource locator (URL) information, such as bypreventing a user from accessing a specific URL (e.g., denying access tohttp://www.example.com), or by preventing a user from accessing acategory of the URL (e.g., denying access to sites classified as “socialnetworking” sites or “pornographic” sites). Unfortunately, given thesheer volume of URLs in existence, it can be difficult to efficientlymatch rules that make use of URL information.

BRIEF DESCRIPTION OF THE DRAWINGS

Various embodiments of the invention are disclosed in the followingdetailed description and the accompanying drawings.

FIG. 1 illustrates an embodiment of an environment in which policiesthat include URL information are enforced.

FIG. 2 illustrates an embodiment of a policy enforcement appliance.

FIG. 3 illustrates an embodiment of a policy enforcement appliance.

FIG. 4A illustrates an example of a URL.

FIG. 4B illustrates a portion of a URL.

FIG. 4C illustrates a portion of a URL.

FIG. 4D illustrates a portion of a URL.

FIG. 5A illustrates a representation of processing performed by a policyenforcement appliance in some embodiments.

FIG. 5B illustrates a representation of processing performed by a policyenforcement appliance in some embodiments.

FIG. 5C illustrates a representation of processing performed by a policyenforcement appliance in some embodiments.

FIG. 6 illustrates an embodiment of a process for enforcing a policybased at least in part on URL information.

DETAILED DESCRIPTION

The invention can be implemented in numerous ways, including as aprocess; an apparatus; a system; a composition of matter; a computerprogram product embodied on a computer readable storage medium; and/or aprocessor, such as a processor configured to execute instructions storedon and/or provided by a memory coupled to the processor. In thisspecification, these implementations, or any other form that theinvention may take, may be referred to as techniques. In general, theorder of the steps of disclosed processes may be altered within thescope of the invention. Unless stated otherwise, a component such as aprocessor or a memory described as being configured to perform a taskmay be implemented as a general component that is temporarily configuredto perform the task at a given time or a specific component that ismanufactured to perform the task. As used herein, the term ‘processor’refers to one or more devices, circuits, and/or processing coresconfigured to process data, such as computer program instructions.

A detailed description of one or more embodiments of the invention isprovided below along with accompanying figures that illustrate theprinciples of the invention. The invention is described in connectionwith such embodiments, but the invention is not limited to anyembodiment. The scope of the invention is limited only by the claims andthe invention encompasses numerous alternatives, modifications andequivalents. Numerous specific details are set forth in the followingdescription in order to provide a thorough understanding of theinvention. These details are provided for the purpose of example and theinvention may be practiced according to the claims without some or allof these specific details. For the purpose of clarity, technicalmaterial that is known in the technical fields related to the inventionhas not been described in detail so that the invention is notunnecessarily obscured.

FIG. 1 illustrates an embodiment of an environment in which policiesthat include URL information are enforced. In the example shown, clients104 and 106 are a laptop computer and desktop computer, respectively,present in an enterprise network 108. Policy enforcement appliance 102(also referred to herein as “appliance 102”) is configured to enforcepolicies regarding communications between clients, such as clients 104and 106, and nodes outside of enterprise network 108 (e.g., reachablevia external network 110). One example of a policy is a rule prohibitingany access to site 112 (a pornographic website) by any client insidenetwork 108. Another example of a policy is a rule prohibiting access tosocial networking site 114 by clients between the hours of 9 am and 6pm. Yet another example of a policy is a rule allowing access tostreaming video website 116, subject to a bandwidth or other consumptionconstraint. Other types of policies can also be enforced, such as onesgoverning traffic shaping, quality of service, or routing with respectto a given URL, pattern of URLs, category of URL, or other URLinformation. In some embodiments, policy enforcement appliance 102 isalso configured to enforce policies with respect to traffic that stayswithin enterprise network 108.

The functionality provided by policy enforcement appliance 102 can beimplemented in a variety of ways. Specifically, policy enforcementappliance 102 can be a dedicated device or set of devices. Thefunctionality provided by appliance 102 can also be integrated into orexecuted as software on a general purpose computer, a computer server, agateway, and/or a network/routing device. Further, whenever appliance102 is described as performing a task, a single component, a subset ofcomponents, or all components of appliance 102 may cooperate to performthe task. Similarly, whenever a component of appliance 102 is describedas performing a task, a subcomponent may perform the task and/or thecomponent may perform the task in conjunction with other components. Invarious embodiments, portions of appliance 102 are provided by one ormore third parties. Depending on factors such as the amount of computingresources available to appliance 102, various logical components and/orfeatures of appliance 102 may be omitted and the techniques describedherein adapted accordingly. Similarly, additional logicalcomponents/features can be added to system 102 as applicable. As oneexample, multiple bloom filters may be included.

FIG. 2 illustrates an embodiment of a policy enforcement appliance. Theexample shown is a representation of physical components that areincluded in appliance 102, in some embodiments. Specifically, appliance102 includes a high performance multi-core CPU 202 and RAM 204.Appliance 102 also includes a storage 210 (such as one or more harddisks), which is used to store policy and other configurationinformation, as well as URL information. Data appliance 102 can alsoinclude one or more optional hardware accelerators. For example, dataappliance 102 can include a cryptographic engine 206 configured toperform encryption and decryption operations, and one or more FPGAs 208configured to perform matching, act as network processors, and/orperform other tasks.

FIG. 3 illustrates an embodiment of a policy enforcement appliance. Inthe example shown, the functionality of policy enforcement appliance 102is implemented in a firewall. Specifically, appliance 102 includes amanagement plane 302 and a data plane 304. The management plane isresponsible for managing user interactions, such as by providing a userinterface for configuring policies (318) and viewing log data. The dataplane is responsible for managing data, such as by performing packetprocessing (e.g., to extract URLs) and session handling. In variousembodiments, a scheduler is responsible for managing the scheduling ofrequests (e.g., as presented by data plane 304 to management plane 302,or as presented by management plane 302 to URL server 316).

One task performed by the firewall is URL filtering. Suppose network 108belongs to a company, “ACME Corporation.” Specified in appliance 102 area set of policies 318, some of which govern the types of websites thatemployees may access, and under what conditions. As one example,included in the firewall is a policy that permits employees to accessnews-related websites. Another policy included in the firewallprohibits, at all times, employees from accessing pornographic websites.Also included in the firewall is a database of URLs and associatedcategories. Other information can also be associated with the URLs inthe database instead of or in addition to category information, and thatother information can be used in conjunction with policy enforcement.

In some embodiments, the database is provided by a third party, such asthrough a subscription service. In such a scenario, it is possible thatinstead of the URLs being directly stored in database 312, atransformation is applied to the URLs prior to storage. As one example,MD5 hashes of URLs can be stored in database 312, rather than the URLsthemselves. The URLs stored in database 312 (or transformations thereof)represent the top n URLs for which access is most likely to be sought byusers of client devices, such as client 104, where n can be configuredbased on the computing and other resources available to appliance 102.As one example, database 312 includes 20 million URLs and is stored instorage 210. A bloom filter 308 is compiled from the contents ofdatabase 312 and is loaded into RAM 204. In some embodiments, the bloomfilter is compiled as a bitmask. Whenever changes are made to database312 (e.g., as an update provided by a vendor), bloom filter 308 isrecompiled. Also included in the firewall are various caches 306, 312,and 314, also loaded into RAM 204. In some embodiments, all or some ofcaches 306, 312, and 314 are omitted from appliance 102 and theprocessing described herein is adapted accordingly. Additional detailregarding components shown in FIG. 3 will be provided below.

When a user of client 104 (an employee referred to herein as “Alice”)attempts to engage in activities such as web surfing, communicationsfrom and to the client pass through policy enforcement appliance 102. Asone example, suppose Alice has launched a web browser application onclient 104 and would like to visit an arbitrary web page. Appliance 102is configured to evaluate the URL of the site Alice would like to visitand determine whether access should be permitted.

FIG. 4A illustrates an example of a URL (402) and FIGS. 4B-4D illustrateportions of URL 402. In particular, FIG. 4B illustrates URL 402 upthrough the first subpath, FIG. 4C illustrates the hostname portion ofURL 402, and FIG. 4D illustrates the domain portion of URL 402. Portions404-408 are also referred to herein as “URLs 404-408.” In someembodiments, in the processing described in more detail below, a matchagainst the most specific portion of URL 402 (e.g., URL 404) will befirst attempted, with fallbacks to more generalized versions of the URL(e.g., URLs 406, and 408, respectively).

Suppose Alice would like to visit URL 402—the California-specific frontpage of an online news service—and enters that URL into her browser. Insome embodiments, the URL is evaluated by appliance 102 as follows. Inthe first stage of the evaluation, the data plane consults cache 306 forthe presence of each of URLs 404, 406, and 408, in order, until a matchis found. If one of the URLs is present, the associated category that isalso stored in cache 306 is used to enforce any applicable policies 318.If none of the URLs are present in cache 306, a temporary entry isinserted into cache 306 indicating that the URL is being resolved. Asone example, a URL being resolved is assigned a temporary category of“UNRESOLVED.” In some embodiments, an entry for each of URLs 404-408(and a corresponding status of “UNRESOLVED”) is included in cache 306.In other embodiments, only one entry is made, such as an entry for URL404. Additional requests received by appliance 102 for access to URL 402(or portions thereof) will be queued pending the resolution. In variousembodiments, a timeout condition is placed on UNRESOLVED entriesincluded in cache 306, such that if the entry is not updated within aspecified period of time, the entry is removed.

Assuming the URL remains unresolved, the data plane sends a request tothe management plane for evaluation of the URL. The next stage ofevaluation is for the management plane to perform a match against bloomfilter 308. URL 404 is checked first, as follows: URL 404 is transformedas applicable (e.g., an MD5 hash of URL 404 is computed). For theremainder of the discussion of this example, no distinction will be madebetween the URL and the MD5 (or other transformation) of the URL, to aidin clarity. It is to be assumed that if database 312 stores MD5 hashes,the queries performed against it (and the corresponding bloom filter andqueries against the bloom filter) will be performed using MD5 (or otherapplicable) transformations of URLs.

A REJECT response, if received from bloom filter 308 for URL 404,indicates with 100% confidence that URL 404 is not present in database312. An ACCEPT response indicates that URL 404 is present in database312, subject to a given false positive rate. The desired false positiverate of bloom filter 308 is configurable and is in some embodiments setat 10%, meaning that an ACCEPT response indicates, with 90% confidence,that the URL is present in database 312. Additional detail of howelements 308, 310, and 312 are used to process URLs is provided withreference to FIGS. 5A-5C.

FIGS. 5A-5C illustrate representations of processing performed by apolicy enforcement appliance in some embodiments. In the examples shown,assume URL 408 is present in database 312 (i.e., an MD5 hash of URL 408is present), while URLs 404 and 406 are not. Further, assume that bloomfilter 308 will indicate a false positive for URL 404. First, a match isperformed using URL 404 (502). Bloom filter 308 reports an “accept,”(504) meaning that there is a 90% chance that URL 404 is present indatabase 312. Cache 310 is evaluated for the presence of URL 404 (506).URL 404 is not present in the cache (508), and so a query of database312 is performed using URL 404 (510). As mentioned above, the ACCEPTanceof URL 404 by the bloom filter was a false positive. URL 404 is notpresent in database 312. Accordingly, the query of database 312 for URL404 will also fail (512). Next, a match against bloom filter 308 for URL406 is performed (532). The bloom filter reports a REJECT (534),indicating with 100% confidence that the URL is not present in database312. There is accordingly no need to perform lookups against cache 310or database 312 using URL 406. Finally, a match against bloom filter 308for URL 408 is performed (572). The bloom filter reports an ACCEPT,(574) meaning that there is a 90% chance that URL 408 is present indatabase 312. Cache 310 is evaluated for the presence of URL 408 (576).URL 408 is not present in the cache (578), and so a query of database312 is performed using URL 408 (580). In this case, URL 508 is presentin database 312 and so the corresponding category NEWS is returned (582)and ultimately provided to data plane 304, which will update the entryin cache 306 by changing the UNRESOLVED category to NEWS. In someembodiments, only the finally matched URL (408) is updated in cache 306.In other embodiments, entries for each of URLs 404, 406, and 408 areupdated in cache 306 with a NEWS category. The category will be used bythe firewall to enforce any applicable rules. In this case, for example,Alice's attempt to access URL 402 with her browser will be allowed,because her request has been associated with an attempt to access a NEWSsite, which is a permissible use. Cache 310 is also updated to includethe returned category and URL 408 (i.e., its MD5 hash). In someembodiments, cache 310 is also updated when result 512 is returned. Inthat case, URL 404 is included in cache 310 along with a category ofUNKNOWN. In various embodiments, when result 582 is returned, theUNKNOWN category included in cache 310 for URL 404 is modified to matchthe result.

Returning to the description of FIG. 3, assume that none of URLs 404-408are present in database 312. The next phase of evaluation performed bythe management plane would be to consult cache 314 to see if any of theURLs are present therein. As with the previous phases, if one of theURLs is present, the corresponding category (e.g., “NEWS”) will bereturned as a result and can be used by the firewall in policyenforcement (and included in cache 306). If the URLs are also absentfrom cache 314, one or more remote URL servers, such as URL server 316,is queried. In some embodiments, URL server 316 is made available by theprovider of the contents of database 312, and contains URL informationthat supplements the information included in database 312 (e.g., byincluding many millions of additional URLs and correspondingcategories). URL server 316 can also be under the control of the ownerof appliance 102 or any other appropriate party. In various embodiments,a bloom filter corresponding to the data stored by URL server 316 isincluded in appliance 102.

In the event that URLs 404-408 are also absent from URL server 316, acategory of UNKNOWN will be returned and appropriate policies applied,based on the category, such as by blocking access to URL 402. Cache 306can also be updated by switching the temporary category of UNRESOLVED toUNKNOWN. As with cache 310, cache 314 is updated based on resultsreturned by URL server 316. In some embodiments, URLs with UNKNOWNcategorization have a timeout, thus allowing for resolution of thecategorization during a subsequent request.

FIG. 6 illustrates an embodiment of a process for enforcing a policybased at least in part on URL information. In some embodiments, theprocess shown in FIG. 6 is performed by policy enforcement appliance 102and, in various embodiments, multiple instances of the process shown inFIG. 6 or portions thereof are performed in parallel on appliance 102,as applicable. The process begins at 602 when a URL is received. As oneexample, at 602 a URL is received when data plane 304 extracts a URL outof a packet received from client 104. At 604, the URL is matched againsta bloom filter. In various embodiments, what is matched is a portion ofthe URL (e.g., portions 404-408 of URL 402), and/or a transformation ofthe URL (e.g., an MD5 hash of the URL or URL portion). As one example ofthe processing performed at 604, URL 404 is matched against bloom filter308 (as illustrated in FIG. 5A at 502). As an alternate example of theprocessing performed at 604, URL 408 is matched against bloom filter 308(as illustrated in FIG. 5C at 572). At 606, a first query is performed,based on a result of the match. As one example of the processingperformed at 606, query 510 is performed. As an alternate example of theprocessing performed at 606, query 580 is performed. Finally, at 608, apolicy is enforced based at least in part on a category received as aresult of a second query. As one example of the processing performed at608, a policy is enforced based on the receipt of the “NEWS” category.In some cases, the first and second query may be different (e.g., wherethe first query is query 576 and the second query is 580; where thefirst query is query 506 or query 510 and the second query is query 580;or where the first query is performed against database 312 and thesecond query is performed against cache 314 or remote URL server 316).In some cases, such as where cache 310 is omitted, the first and secondquery may be the same (e.g., where the first and second queries are bothquery 580).

Although the foregoing embodiments have been described in some detailfor purposes of clarity of understanding, the invention is not limitedto the details provided. There are many alternative ways of implementingthe invention. The disclosed embodiments are illustrative and notrestrictive.

What is claimed is:
 1. A system, comprising: a processor configured to:receive, from a client device, a request to access a first uniformresource locator (URL); match at least a portion of the received firstURL against a bloom filter; in response to receiving an “accept” as aresult of the match against the bloom filter, perform a first queryusing the matched portion of the received first URL, and in response toreceiving a “no match” response to the first query, perform a secondquery that is different from the first query; and based at least in parton a category received as a result of a second query, enforce a policywith respect to the request to access the received first URL; and amemory coupled to the processor and configured to provide the processorwith instructions.
 2. The system of claim 1 wherein the content of thefirst and second query is the same, and wherein the first and secondquery are submitted to two different entities.
 3. The system of claim 1wherein the first query is performed against a cache associated with adatabase associated with the bloom filter.
 4. The system of claim 1wherein the first query is performed against a database associated withthe bloom filter.
 5. The system of claim 1 wherein the second query is ashortened version of the first query.
 6. The system of claim 1 whereinthe second query is performed against a database that is not associatedwith the bloom filter.
 7. The system of claim 1 wherein the second queryis performed against a cache that is not associated with a databaseassociated with the bloom filter.
 8. The system of claim 1 wherein atleast one of the first query and the second query includes an MD5 hashof a representation of at least a portion of the first URL.
 9. Thesystem of claim 1 wherein the processor is further configured to rewritethe first URL.
 10. The system of claim 9 wherein the processor isconfigured to match the rewritten URL against the bloom filter.
 11. Thesystem of claim 1 wherein the processor is further configured toassociate the received category with the first URL in a cache.
 12. Thesystem of claim 1 wherein matching comprises matching a transformationof the first URL against the bloom filter.
 13. The system of claim 12wherein the transformation includes an MD5 hash.
 14. A method,comprising: receiving, from a client device, a request to access a firstuniform resource locator (URL); matching at least a portion of thereceived first URL against a bloom filter, using a processor; inresponse to receiving an “accept” as a result of the match against thebloom filter, performing a first query using the matched portion of thereceived first URL, and in response to receiving a “no match” responseto the first query, performing a second query that is different from thefirst query; and based at least in part on a category received as aresult of a second query, enforcing a policy with respect to the requestto access the received first URL.
 15. The method of claim 14 wherein thefirst query is performed against a database associated with the bloomfilter.
 16. The method of claim 14 wherein the second query is ashortened version of the first query.
 17. The method of claim 14 whereinthe second query is performed against a database that is not associatedwith the bloom filter.
 18. The method of claim 14 further comprisingrewriting the first URL.
 19. The method of claim 14 further comprisingassociating the received category with the first URL in a cache.
 20. Acomputer program product embodied in a non-transitory computer readablestorage medium and comprising computer instructions for: receiving, froma client device, a request to access a first uniform resource locator(URL); matching at least a portion of the received first URL against abloom filter; in response to receiving an “accept” as a result of thematch against the bloom filter, performing a first query using thematched portion of the received first URL, and in response to receivinga “no match” response to the first query, performing a second query thatis different from the first query; and based at least in part on acategory received as a result of a second query, enforcing a policy withrespect to the request to access the received first URL.